AITopics

Country: Europe (0.46)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsJun-18-2026, 02:01:39 GMT

Flat Channels to Infinity in Neural Loss Landscapes

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, ai and aj, diverge to infinity, and their input weight vectors, wi and wj, become equal to each other. At convergence, the two neurons implement a gated linear unit: aiσ(wi x) + ajσ(wj x) cσ(w x) + (v x)σ (w x). Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

artificial intelligence, machine learning, saddle line, (16 more...)

Country:

North America > United States (0.28)
Europe (0.28)
Asia (0.28)

Genre: Research Report > Experimental Study (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsJun-17-2026, 03:31:37 GMT

ATale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models

Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce fundamental obstacles to optimization? Or do they simply require different hyperparameter tuning?

artificial intelligence, machine learning, representation, (17 more...)

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Azaïs, Jean-Marc, Dalmao, Federico, De Castro, Yohann

Non-asymptotic Tail Bounds for the Kostlan--Shub--Smale Field: Tensor PCA and Spherical $k$-Spin Complexity

arXiv.org Machine LearningJun-17-2026

This paper builds a hierarchy of explicit, non-asymptotic tail bounds for the supremum of the Kostlan--Shub--Smale (KSS) random field on the sphere, and applies it to two problems: Spiked Tensor PCA and the landscape of the spherical $k$-spin model. For Tensor PCA, we study the non-asymptotic statistical limits of estimating a rank-$R$ symmetric signal tensor of order~$k\ge 3$ and dimension~$d\ge 3$ from a single Gaussian observation at signal-to-noise ratio~$λ$, through the \emph{profile maximum likelihood estimator}, the MLE restricted to normalized rank-$R$ tensors of coherence at least~$κ$. Our analysis uses a single reduction: a deterministic geometric inequality (the Tube Method) and a rank-reduction step bound the estimation error by the supremum of the canonical KSS field, which the Kac--Rice formula turns into a Gaussian integral against the expected absolute characteristic polynomial of a shifted Gaussian Orthogonal Ensemble, controlled in turn by the four explicit tail bounds of our hierarchy (three from a Mehta--Fyodorov representation, one from a Ben Arous--Dembo--Guionnet large deviation). The same reduction yields two results, each with explicit constants. For estimation, a finite-$(k,d)$ error bound recovers the asymptotically optimal rate~$\sqrt{d\log k}$ of Perry, Wein and Bandeira, with explicit dependence on the rank~$R$ and the coherence~$κ$. For the landscape, a two-sided non-asymptotic bracketing of the annealed complexity of the spherical $k$-spin Hamiltonian recovers the Auffinger--Ben Arous--Černý complexity function in the high-dimensional limit.

artificial intelligence, bayesian inference, machine learning, (20 more...)

2606.17665

Country:

North America > Canada (0.45)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.45)
Europe > France (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Neural Information Processing SystemsJun-16-2026, 02:38:05 GMT

ACloser Look at NTKAlignment: Linking Phase Transitions in Deep Image Regression

Deep neural networks trained with gradient descent exhibit varying rates of learning for different patterns. However, the complexity of fitting models to data makes direct elucidation of the dynamics of learned patterns challenging. To circumvent this, many works have opted to characterize phases of learning through summary statistics known as order parameters. In this work, we propose a unifying framework for constructing order parameters based on the Neural Tangent Kernel (NTK), in which the relationship with the data set is more transparent. In particular, we derive a local approximation of the NTK for a class of deep regression models (SIRENs) trained to reconstruct natural images. In so doing, we analytically connect three seemingly distinct phase transitions: the emergence of wave patterns in residuals (a novel observation), loss rate collapse, and NTK alignment. Our results provide a dynamical perspective on the observed biases of SIRENs, and deep image regression models more generally.

artificial intelligence, deep learning, machine learning, (17 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Neural Information Processing SystemsJun-14-2026, 16:41:07 GMT

The Persistence of Neural Collapse Despite Low-Rank Bias

Neural collapse (NC) and its multi-layer variant, deep neural collapse (DNC), describe a structured geometry that occurs in the features and weights of trained deep networks. Recent theoretical work by Sukenik et al. using a deep unconstrained feature model (UFM) suggests that DNC is suboptimal under mean squared error (MSE) loss. They heuristically argue that this is due to low-rank bias induced by L2 regularization. In this work, we extend this result to deep UFMs trained with cross-entropy loss, showing that high-rank structures--including DNC--are not generally optimal. We characterize the associated low-rank bias, proving a fixed bound on the number of non-negligible singular values at global minima as network depth increases. We further analyze the loss surface, demonstrating that DNC is more prevalent in the landscape than other critical configurations, which we argue explains its frequent empirical appearance. Our results are validated through experiments in deep UFMs and deep neural networks.

artificial intelligence, machine learning, matrix, (15 more...)

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Mun, Kyunghoo, Rosenzweig, Matthew

Phase transitions for the noisy transformer model in arbitrary dimension

arXiv.org Machine LearningJun-4-2026

We study the McKean--Vlasov free energy on the unit sphere associated with the unnormalized self-attention (USA) model for noisy transformer dynamics. We prove a sharp global-minimizer dichotomy in every dimension $d\ge2$. There is a unique $β_*^{(d)}>0$ such that \begin{equation*} \frac{I_{d/2+1}(β_*^{(d)})}{I_{d/2}(β_*^{(d)})}=\frac1d, \end{equation*} where $I_ν$ is the modified Bessel function of the first kind. For $0<β\le β_*^{(d)}$, the uniform density remains the unique global minimizer up to the linear-stability threshold \begin{equation*} K_\#^{(d)}(β)=\frac{β^{d/2}}{2^{d/2}Γ(d/2)I_{d/2}(β)}, \end{equation*} and the phase transition is continuous. For $β>β_*^{(d)}$, the uniform density is not globally minimizing at $K_\#^{(d)}(β)$, so the critical coupling satisfies $K_c

artificial intelligence, machine learning, sd 1, (18 more...)

2606.0514

Country: North America > United States (0.66)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Lascu, Razvan-Andrei, Suzuki, Taiji

From Saddle Points Toward Global Minima: A Newton-Type Method on Wasserstein Space

arXiv.org Machine LearningMay-19-2026

We study the minimization of non-convex functionals over the Wasserstein space. While recent work has showed that perturbed Wasserstein gradient methods can avoid saddle points for benign landscapes, existing approaches remain essentially first-order and do not provide fast local convergence once the iterates enter a neighborhood of a global minimizer. We propose Wasserstein Saddle-Free Newton (WSFN), a second-order method that preconditions the Wasserstein gradient by a regularized square root of the squared Wasserstein Hessian. This construction preserves attraction toward directions of positive curvature while inducing repulsion along directions of negative curvature, thereby overcoming the tendency of standard Wasserstein Newton dynamics to be attracted to saddles. We also establish second-order sufficient optimality conditions on Wasserstein space for strict local minimality. Under regularity and benign landscape assumptions, we prove that WSFN escapes saddle regions and reaches an $α$-neighborhood of a global minimizer in polynomial time, with improved dependence on saddle parameters compared with prior perturbed first-order methods. Once inside this neighborhood, we show that WSFN converges linearly in $L^2$-Wasserstein distance to a non-degenerate global minimizer. Finally, we present a particle-based implementation of the method.

artificial intelligence, machine learning, minimizer, (17 more...)

2605.17963

Country:

Asia > Japan (0.28)
North America > United States (0.27)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Maulen-Soto, Rodrigo, Boyer, Claire

Attention-based PCA

arXiv.org Machine LearningMay-19-2026

We study attention mechanisms through the lens of a canonical unsupervised problem: principal component analysis (PCA). We show that, when trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finiteprompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction. These results demonstrate that attention inherently performs PCA-like computations under unsupervised objectives, providing a theoretical foundation for its representation-learning capabilities.

artificial intelligence, machine learning, natural language, (19 more...)

2605.18315

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Neural Information Processing SystemsApr-30-2026, 08:09:14 GMT

f5ccb3ab757131a93586ef61ec701533-Supplemental-Conference.pdf

In this section, we compare the symmetric solutions found in erf [2] and ReLU networks [5] to our one-neuron solution (n =1). The main difference is that both earlier studies constrain the search space to the symmetric subspace whereas we first prove that the non-trivial critical points are contained in this subspace in Theorem 5.1 for a broad class of activation functions, including erf and ReLU. Solving the low-dimensional loss, we recover the same solution for ReLU and erf as in [2, 5] for unit-orthonormal teachers.

artificial intelligence, critical point, machine learning, (19 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.37)